klotz: data quality* + data engineering*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. These one-liners provide quick and effective ways to assess the quality and consistency of the data within a Pandas DataFrame.

    Code Snippet Explanation
    df.isnull().sum() Counts the number of missing values per column.
    df.duplicated().sum() Counts the number of duplicate rows in the DataFrame.
    df.describe() Provides basic descriptive statistics of numerical columns.
    df.info() Displays a concise summary of the DataFrame including data types and presence of null values.
    df.nunique() Counts the number of unique values per column.
    df.apply(lambda x: x.nunique() / x.count() * 100) Computes the percentage of unique values for each column.
    df.isin( value » ).sum() Counts the number of occurrences of a specific value across all columns.
    df.applymap(lambda x: isinstance(x, type_to_check)).sum() Counts the number of values of a specific type (e.g., int, str) per column.
    df.dtypes Lists the data type for each column in the DataFrame.
    df.sample(n) Returns a random sample of n rows from the DataFrame.
  2. This article explains how to quickly detect data quality issues and identify their causes using Python for ETL pipelines. It discusses strategies to minimize the time required to fix data quality problems.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: data quality + data engineering

About - Propulsed by SemanticScuttle